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Abstract 

As the scarce spectrum resource is becoming over-crowded, cognitive radios (CRs) indicate great 
flexibility to improve the spectrum efficiency by opportunisticaUy accessing the authorized frequency 
bands. One of the critical challenges for operating such radios in a network is how to efficiently 
allocate transmission powers and frequency resource among the secondary users (SUs) while satisfying 
the quality-of-service (QoS) constraints of the primary users (PUs). In this paper, we focus on the 
non-cooperative power allocation problem in cognitive wireless mesh networks (CogMesh) formed 
by a number of clusters with the consideration of energy efficiency. Due to the SUs' selfish and 
spontaneous properties, the problem is modeled as a stochastic learning process. We first extend the 
single-agent Q-learning to a multi-user context, and then propose a conjecture based multi-agent Q- 
learning algorithm to achieve the optimal transmission strategies with only private and incomplete 
information. An intelligent SU performs Q-function updates based on the conjecture over the other 
SUs' stochastic behaviors. This learning algorithm provably converges given certain restrictions that 
arise during learning procedure. Simulation experiments are used to verify the performance of our 
algorithm and demonstrate its effectiveness of improving the energy efficiency. 

Index Terms 

cognitive radio, cognitive wireless mesh networks, dynamic spectrum access, power allocation. 
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green communication, reinforcement learning, multi-agent Q-learning, conjecture 

I. Introduction 

In wireless communications, the electromagnetic radio frequency is the most precious re- 
source, the use of which is regulated by governmental agencies on a long-term basis for large 
geographical regions. Currently, the frequency band is overcrowded and there hardly exists space 
available for the emerging wireless services. However, on the other hand, we are increasingly 
beginning to see that the fixed spectrum allocation policy has resulted in vastly under-utilized 
spectrum holes. In November 2002, the Federal Communications Commission (FCC) published 
a report which shows up to 70% of the allocated spectrum in certain measurement geographical 
areas in the United States are idle in most of the time [fTTl . The limited available spectrum and 
the inefficiency in the spectrum usage necessitate a new communication paradigm to exploit the 
existing wireless spectrum opportunistically [[T]. New approaches such as opportunistic spectrum 
access (OSA) and dynamic spectrum access (DSA) are proposed to bridge the enormous gulf in 
time and space between the regulation and the potential spectrum efficiency. CR is a promising 
radio technique possessing intrinsic capability to exploit these spectrum holes by sensing a 
wide range of the frequency bands and identifying currently unused spectrum blocks, and then 
communicating by an opportunistically overlaying manner [|9l, [fT6]| . 

Up to now, the research on CR has already penetrated into different types of wireless networks, 
and covered almost every aspect of wireless communications [fT2l . [|T9l , [|22l . [|25l . In this 
paper, we focus our emphasis on the cognitive wireless mesh networking scenario, named as 
CogMesh as described in our previous work [2]. One of the critical challenges in deploying 
CogMesh is how to design an efficient power allocation scheme for the usage of detected available 
'spectrum holes' among the SUs while achieving interference-tolerable spectrum sharing with 
the neighboring PUs. An efficient design is to maximize the network performance subject to 
guaranteeing the PU transmissions and the signal-to-interference-plus-noise ratio (SINR) of the 
SUs' ongoing connections. The transmission power is a 'double-blade' sword. On one hand, 
the higher the transmission power, the better performance a SU can expect; on the other hand, 
this better performance is obtained at the expense of not only causing higher interference to 
both the PUs and the other SUs, but also increasing power consumption. In wireless networks, 
the choice of transmission power fundamentally affects the performance of multiple protocol 
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layers. Recently, there has been much work on formulating the power allocation problem with 
cross layer design. The interested reader is referred to [1T0| and cited references therein. But they 
assume that the users are cooperative. Thus, the cross layer design problem can be converted 
to the systems optimal design. In our considered CogMesh scenario, cooperation among the 
neighboring clusters helps to quantify the tradeoff, for example if a central entity controls the 
signaling in the network, it can update and broadcast the relevant information to all clusters and 
their registered SUs. 

However, it's more suitable to address the power allocation of CogMesh within a non-cooperative 
game-theoretic framework, since there are conflicting interests among the clusters. [161 considered 
non-cooperative energy efficient spectrum access for a wireless CR ad hoc network by combining 
an unconstrained optimization method with a constrained partitioning procedure. [1251 studied the 
distributed multi-channel power allocation for CR networks with strategy space to address both 
the co-channel interference among SUs and the interference temperature regulation imposed by 
PUs. In [[221 . Fan et al. proposed a price based spectrum management scheme for CR networks. 
Assuming that SUs repeatedly negotiating their best transmission powers and spectrum, SUs 
announce prices to reflect their sensitivities to the current interference levels, and then adjust 
their transmission powers. Our work originates from this non-cooperative problem, whereas we 
propose a reinforcement learning algorithm in this paper to deal with it. 

In order to formulate the non-cooperative game theoretically, we first model the self-interest 
property of power allocation in CogMesh. Generally, the concept of reward refers to the level of 
satisfaction the decision-maker receives as a return of its performed action. We construct a reward 
function with the consideration of energy-efficiency. Based on the reward function, we model 
the selfish behaviors as a non-cooperative power allocation game, that is, each SU maximizes its 
own reward, regardless of what all the other SUs do. In spite of this selfish nature, it is significant 
for the SUs to adapt to the environment changes since energy efficiency is highly dependent on 
environmental factors like primary users' behavior patterns and traffic QoS requirement. 

Therefore, we formulate the power allocation in CogMesh as a stochastic learning process [01, 
[HH, [[2Ql, [[261 featured by non-cooperative game playing among the local clusters, in which 
the SUs are spontaneous rational players with advanced learning capability; but the SUs may 
be selfish at some extent. Then we adopt the framework of reinforcement learning known as 
Q-leaming [1201 in this paper. As illustrated in Fig. 1, during the learning procedure, the SU 
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updates its strategy according to its experience with different actions without explicit modeling 
of the environment. Based on the single-agent Q-leaming algorithm, a multi-agent Q-leaming 
is proposed to accomplish the problem of multi-user stochastic learning. One challenge of the 
proposed approach to our scenario is that the SUs do not know the information of other SUs 
due to the non-cooperation among clusters. Then the networking environment is non-stationary 
for all SUs and the convergence of learning process may not be assured. To alleviate the lack 
of mutual information exchange, the SUs form internal conjectures over how the other SUs 
react to their present actions with only local observations from direct interactions with the 
CogMesh environment. Learning is finished asymptotically by appropriately making the use of 
past experience. Essentially, our argument is that every rational SU has the motivation to improve 
its performance even if they are selfish by nature. 

Some work about reinforcement learning in CR networks have been investigated jSl, [[T4|. 
where the studies are focused on the channel allocation, which is different from the topic in 
this paper. Our work is the first one toward exploring the multi-agent Q-leaming theory in 
the stochastic non-cooperative power allocation game in CR networks, especially, CogMesh. 
Compared to the previous work, this work provides the following three key insights: 

• Firstly, for the non-cooperative power allocation game in CogMesh, we show that the selfish 
dynamics exist in the stochastic learning process. 

• Secondly, we present a reinforcement learning algorithm where the update rule is based on 
SU's own private and incomplete information; the selfish learning dynamics converge. 

• Thirdly, this paper also contributes to the general literature on multi-agent Q-leaming. 
While traditional multi-agent Q-leaming algorithms, such as Nash-Q ifTTI and CE-Q [(8l in 
computer science (CS), rely on the full information of all agents in the environment. This is 
impossible in the scenarios of wireless communication, since there exist conflicting interests 
among the users. Thereupon, we developed a conjecture-based multi-agent Q-leaming. 

The rest of this paper is organized as follows. In the next section, we briefly introduce the 
single-agent Q-leaming algorithm and its extension to multi-agent scenarios. In Section III, 
we formulate the non-cooperative power allocation problem as a stochastic learning game, for 
which we also present the design objective and the relevant challenging issues. In Section IV, 
we propose a conjecture-based multi-agent Q-leaming algorithm; and the convergence of the 
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proposed algorithm is further investigated. The numerical results are included in Section V, 
verifying the validity and efficiency of the proposed algorithm. Finally, we present in Section 
VI a conclusion of this paper. 

II. Preliminaries of Q-Learning Algorithm 

In this section we first give a brief introduction on the single-agent Q-leaming algorithm, and 
then extend the algorithm to multi-agent scenarios. Our description adopts standard notations 
and terminologies from the framework of reinforcement learning |I71, [[TT|. EOl . 

A. Single-agent Q-learning 

The environment, which an agent interacts with, is typically formulated as a finite-state Markov 
Decision Process (MDP). Let 5 be a discrete set of environment states, and A a discrete set of 
actions. At each step t, the agent senses the state s* = s G 5 and selects an action = a E A 
to perform. As a result, the environment makes a transition to the new state s^'^^ = s' G S 
according to probability Tss'{a) and thereby generates a feedback (reward) r* = r{s^,a) G R 
passing to the agent. This process is repeated infinitely. 

The task of the agent is then to learn an optimal policy vr*(s) for each s, which maximizes 
the total expected discounted reward over an infinite steps. 



V^(s) = E 



(1) 



where Sq is the initial state, E means the expectation operator and /3 G [0, 1) is the discount 
factor. We can rewrite Equation (1) as [|20l 



= E[r{s, 7r{s))] + /5 J] Ts47r{s))V-{s'). 

s'g5 

It has been proven that the optimal policy satisfies the Bellman optimality equation 

V*{s) = V-\s) = max J E[r{s, a)] + /3 ^ T,,,(a)l^*(/) i . (2) 
""^ I s'es ) 

One of the attractiveness of Q-leaming is that it assumes no a prior knowledge about the state 
transition probabilities Tssi{a). We define the right-hand side of Equation (2) by, 

Q*(s, a) = Q-\s, a) = E[r{s, a)] + /3 T,,,(a)F-* {s'). (3) 



s'es 
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Then by Equation (2), 

V*{s) = max Q*{s,a). 

The optimal state value function V*{s) can be hence obtained from Q*{s,a). And in turn, 
Equation (3) may be expressed as 

Q*{s, a) = E[r{s, a)] + (3j2 {TsAa) 

s'es ^ 

In Q-leaming, the agent tries to find Q*{s, a) in a recursive way with the information {s, a, r*, s'). 
The updating rule is 

+ 13 max Q\s',b) , 

b J 

where at G [0, 1) is the leaming rate. Assuming that each action is executed in each state an 
infinite number of times and the leaming rate at is decayed appropriately in a suitable way, the 
Q*{s,a) will finally converge to Q*{s,a) with probability (w.p.) 1 as t — )• oo [|23l . 

B. Multi-agent Q -learning 

Consider an A^-agent game, each agent is equipped with a standard Q-leaming algorithm and 
leams without any cooperation with the other agents. The received rewards and state transitions, 
however, depend on the joint actions of all agents. Let Si be a discrete set of environment 
states and Ai a discrete set of actions relevant to agent i. At each step t, the agent senses the 
environment state s\ = Si E Si, then independently chooses action ai E Ai. Consequently, agent 
i receives r* = rj(s-,ai, ■ ■ ■,aN) and the environment transits to a new state = s'- E Si 
according to the fixed probabilities Ts.s'^{ai, ■ ■ ■, a^)- Note that r* and T,-^^ are defined over the 
joint actions (ai, ■ ■ ■,aiy). 

111. Problem Formulation 

In this paper, we consider a non-cooperative power allocation system in which each SU 
behaving as a leaming agent adjusts its transmission power level based on some reward received 
from the self-interested CogMesh environment to arrive at the optimal strategy. The key com- 
ponent for describing the selfish interest is the reward function. In this section, we first present 
a reward model for the power allocation, which takes the energy-efficiency into account. Based 
on the reward model, we formalize the power allocation problem through the non-cooperative 



max Q*{s', h) 




g*+i(s, a) = (1 - at)Q\s, a) + at 
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game playing. Following that, we convert the non-cooperative playing into a stochastic learning 
process. Finally, we discuss the design objective and highlight the challenging issues. 

A. Reward Function and Non-cooperative Power Allocation Game 

We consider a generalized CogMesh networking example consisting of several specific PU 
links (i.e., primary transmitter PT and primary receiver PR) and one CR network formed by 
a set A/" = {1, ■ ■ ^A^} of SU links spatially distributed in non-overlapping clusters (see Fig. 
2). Due to opportunistic spectrum accessing, they coexist in the same area and share the same 
frequency band with bandwidth W simultaneously. We assume that each user operates in a 
half-duplex manner, which means it cannot receive any signal when it's transmitting, and vice 
versa. The total interference plus noise measured by any SU includes PU-to-SU interference, 
SU-to-SU interference, and the Additive White Gaussian Noise (AWGN). A SU suggests a CR 
link consisting of a pair of CR nodes, and we use a SU and a CR link interchangeably. 

We designate the transmission power and Signal-to-Interference-plus-Noise Ratio (SINR) for 
SU i by < Pi < p^'^^) and 7^, respectively. The other SUs' transmit power vector is 

denoted by p_j = (pi, ■ ■ ■jPj.ijPj+i, ■ ■ ■,Pn)- Assume that the channel gains evolve slowly with 
respect to the SINR evolution, the SINR of the SU i in this problem formulation is given by 

(J + 0f ^ + E hjiPj 

where hji is the channel gain between the transmitter of SU link j and the receiver of SU link i, 
(pf^ denotes the PU-to-SU interference at the receiver of SU link i, and o is the AWGN power. 

The goal of power allocation within the CogMesh framework is to ensure that no SU's SINR 
falls below its threshold 7* chosen to guarantee adequate QoS, i.e.. 

Furthermore, the opportunistic spectrum access enables the SUs to transmit with overlapping 
spectrum and coverage with PUs, as long as that the performance degradation induced on the 
PUs is tolerable. In this paper, we consider the following power mask constraint as in [|22l . that 
is, the transmission power level of SU i over the detected frequency band is constrained by 

Pi < Pmasky^i ^ ^f, (4) 
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where Pmask is the power mask and is given as a priori. Such a hardware based power mask 
is easier to manipulate at the design stage from a practical point of view. This is because the 
number of active SUs that share the same spectrum with the PUs varies in time and space, it is 
impossible to design the device to account for a 'neighbor-dependent' power mask especially in 
the non-cooperative CogMesh networking. 

To implement non-cooperative power allocation in CogMesh, one of the most important 
concern is the definition of the received reward. As mentioned above, a higher SINR at the 
receiver will generally result in a lower bit error rate and hence a higher throughput. However, 
achieving a high SINR requires the SU to transmit at a high power level, which in turn causes 
more power consumption as well as increases the magnitude of the interference for other users, 
especially the PUs. Accordingly, we choose the average amount of bits received correctly per 
unit of energy consumption as the reward function to quantify the tradeoff (as in [1151 ). as this 
brings practical and meaningful metric to define the energy efficiency, 

^ife>P-i = ^ -■ 

Pi 

Here, T is the gap between un-coded M-QAM and the capacity, minus the coding gain. And we 
assume that CR transmitters use variable-rate M-QAM, with a bounded probability of symbol 
error and trellis coding with a nominal coding gain. 

Considering the power mask constraint (4), meanwhile the maximum transmission power 
level p^^^, the action set of SU i is then Vi = [pf^^,pj'^^], where pf"^'' = mm{p^^'^ , pmask) ■ 
We formulate the SUs' selfish behaviors with the theory of non-cooperative game defined by a 
tuple Q = {J\f, V, {Tli{-)}), where V = Vi x ■ ■ ■ x Vn is the action space available for all SUs. 
Formally, the non-cooperative power allocation game in CogMesh can be defined by 

max7^i(pi,p_J 

S-t. l^ > 

for all i E N . The solution of this game can be derived in the sense of Nash Equilibrium (NE) 

m. 

Definition 1: A transmission power vector {p^, ,plj) is an NE if, for each SU i, 

MpIvU) > MP^,V-^), for all P^ G V,. 
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The following proposition shows the sufficient condition for the existence of an NE in the game 



Proposition 1: For any given Pmask value, there is an NE in game Q if, for z = 1, ■ ■ A^: 

1) The action set Vi is a closed and bounded convex set. 

2) The reward function 7^j(pj,p_-) is continuous in (pj,p_j) and quasi-concave in p.,. 

B. Stochastic Power Allocation by Multi-agent Q-learning 

The wireless communication system can be considered as a discrete-time system. In this 
section, we model the SUs' selfish behaviors within stochastic game framework, in which every 
SU plays the role as an intelligent agent. To be compatible with the multi-agent Q-leaming 
framework, we first discrete the continuous action profile Vi = [pf pf^^~\ as the following 



We designate ai e Ai = {0, ■ ■ -yrrii} as the SU i's action. Then, it's necessary to identify the 
environment state, the associated reward and the next state. 

1) State: Since there is no cooperation among the SUs, the state should be defined based on 
the local observation of the environment. At time slot t, we can express the state s\ observed 
by the SU i as 



Herein, Xj G {0, 1} specifies whether the SU i's SINR 7j at the corresponding receiver end is 
above or below its threshold 7*. That is. 



2) Reward: The reward 7li{si, ai,ii-i) = 7^j(ai,a_i) of SU i in state Si is the immediate 
return due to the execution of action when all the other SUs choose actions a_j = (ai, ■ • 
•, aj_i, Oj+i, ■ ■ ^aAr). Specifically, it is a return of choosing power level Pi{ai) in state Sj to 
ensure the transmission QoS requirement as well as to achieve the power efficiency. 



m. 





i,Ii,Pi{ai))^ . 
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3) Next State: According to the definition of state sj defined in 1), we can see that the state 
transition from sj to s-^^ is determined by the stochastic power allocations of all SUs. 

Thus the non-cooperative game Q is converted to the discrete form Q' = {J\f, {Ai}, {TZ-i}), 
i.e., each SU chooses the strategy 7rj(sj) independently to maximize its total expected discounted 
reward 



max < E 



s*,7ri(s*),7r„,(s*)) 



t=0 



5„- Si 



where 7r__i (s-) = (7ri(si), ■ ■ -, 7ri_i(s*„i), 7ri+i(s*^i), ■ ■ -, 71n{s%)) and Hi is the set of strategies 
available to SU i. A strategy vTj of SU i in state Sj is defined to be a probability vector 7rj(sj) = 
[7rj(sj,0),- ■ ■, 7rj(sj, TTij)], where 7rj(si,aj) means the probability with which the SU i chooses 
action when in state Sj. For the case of completely exact information about the other SUs' 
strategies = (tti, ■ ■ 7rj„i, vTj+i, ■ ■ vr^v), we define the total expected discounted reward of 
SU i over an infinite time slots as 

Vi{si,7ri,7r_i) 



E 



_t=o 

= E [TZi (si,7ri(si),7r_i(si))] + P^T,^,:^{-Ki{si),TT_i{si))Vi (s^,7ri,7r_ 
where T^.^i^^-) is the state transition probability, and 



E[7^i(si,7^i(si),7r_^(s^))] = X] ■ • ■ 



N 



ni{ai,a^i)YlT^j{sj,aj] 



In the stochastic power allocation game, each SU behaves as an learning agent whose task is to 



learn the optimal strategy n*{si){i = 1,- ■ ■, N) for each state Sj. Let tt* ^ = (nl 



Definition 2: A tuple of strategies (tt*, ttIJ is an NE if, for each SU i, 

Vi{si,Tr*,n*_i) > Vi{si,7ii,n*_i), for all TTi e Hj. 

Every finite strategic-form game has a mixed strategy equilibrium that is, there always 
exists an NE in our game formulation of stochastic power allocation. The optimal strategy 
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satisfies the Bellman optimality equation, that is, for secondary user i 



where 



max I E \Ri (sj, a^, Tv*_i{si))] + /3 ^ T..^/ (a^, Tv*_i{si)) Vi (s-, tt*, Tr*^) > , (5) 



N 



We define the optimal Q-value Q* of SU i as the current expected reward plus its future rewards 
when all SUs follow the Nash equilibrium strategies, that is. 



Q*isi,ai) = E [7^, (s,,a,,7r:,(s,))] + Pj^^s.s', {ai,7r*_,{si)) {s[, tt* , 7r*_,) . (6) 
Combining equations (5) and (6), it's easy to get 

The multi-agent Q-leaming process tries to find Q*{si,ai) in a recursive way using the 
information {ui, Si, Si^nf) {i — 1, • • -^N), where s*) and s*''"^) are the states at 

time slot t and t + 1, respectively; and Oj and tt* are the SU i's action taken at the end of time 
slot t and the transmission strategy during time slot t. The proposed multi-agent Q-leaming rule 
is 

^{l-a^)Ql{si,ai) + a* lni{si,ai,a-i) JJ 7r*(sj-, a^) + ^maxg*(s^, 6i) i . (7) 

where a* G [0, 1) is the learning rate. 

An intuitive explanation for Equation (7) is that, once the power level Pi{ai) is selected, the 
increasing quantity in the corresponding Q-value is updated by combining the old value and the 
new expected reward. More specifically, given the probabilities {'^j{sj,aj)}^^-^ j^j^ of the other 
SUs choosing power levels {pj(aj)}j^^j_^j, if the SU i achieves higher reward TZi{ai, a_i) when 
selecting power level Pi{ai), then the aj)-value is increased by a higher value. Notice that 
the proposed multi-agent Q-leaming algorithm not only needs the SU i's own information, but 
the strategies of the other SUs. However, in this paper, the strategy is myopic since we assume 
that there is no cooperation among the SUs. 
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C. Design Objective and Challenging Issues 

Our aim is to design stochastic power allocation in non-cooperative CogMesh with self- 
interested SUs. The reward of each SU is a function of the joint actions of all SUs. Accordingly, 
we apply the multi-agent Q-leaming approach to model the interaction among the SUs' strategy 
decisions. Rather than choosing the best transmission power level, a SU in the stochastic learning 
process chooses the best mixed strategy. The problem is challenging due to the fact that every 
SU may not be aware of the following two things during the learning process: 

1) the number of SUs coexist in the system; 

2) strategies available to the other SUs. 

The SU can only observe its own information, such as the environment state, the strategy, and 
the received rewards. 

From Equation (7), in order to learn the optimal strategy, SU i needs to know not only its 
own strategy, but also the other SUs' transmission strategies 7r*(j G J\f\i). Along with the 
discussion, we see that the obtained multi-agent Q-leaming algorithm cannot solve the power 
allocation problem directly because no SU can observe the competing SUs' private information 
in a non-cooperative CogMesh networking scenario. Therefore, the challenging problem arises: 
how to design a stochastic non-cooperative power allocation scheme that guarantees SUs 
learning the optimal strategies with only private and incomplete information? 

IV. Stochastic Power Allocation with Conjecture based Multi- agent 

Q-LEARNING APPROACH 

As discussed in the previous section, the main disadvantage of the derived multi-agent Q- 
leaming algorithm is its requirement to account for the competing SUs' strategy information. In 
non-cooperative power control, however, the SUs only know what reward they are getting from 
their current strategy. In this section, we propose a stochastic non-cooperative power allocation 
scheme with private and incomplete information. To make the multi-agent Q-leaming algorithm 
sensible in non-cooperative CogMesh networking environment, it is clear that the SU needs to 
conjecture the other SUs' strategy decisions without any coordination among the local clusters 
[[24|. This motivates the conjecture based multi-agent Q-leaming. 
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A. Individual Behavior and Evolution 

The goal of this paper is to design a simple non-cooperative power allocation algorithm that 
requires quite limited information exchanges among the SUs. In game-theoretic point of view, 
the reached NE is based on the assumptions about what knowledge the SUs possess and assumes 
that every SU's strategy will not change at the NE. Therefore, the SUs operating at the NE can 
be viewed as leaming agents behaving optimally with respect to their conjectures about the 
strategies of the other SUs. 

From Equation (7), we can see that the SU i's current expected reward depends on both its own 
decision and the other SUs' transmission policies. However, in the non-cooperative scenario, it is 
hard for the SUs to obtain the information of exact transmission strategies of their competitors. 

N 

We define c-(sj, Oj) = Yl ''^ji^j^ '^j) for the SU i in time slot t, to be the conjecture represent- 
ing the aggregated effect on the Ql^^i-Si, aj)-value when all the other SUs choosing actions a_i ac- 
cording to their corresponding strategies Trtiisi) = {ttKsi),-- ■,7r*_i(si_i),7r*_^i(si+i), • • ■,7t%{sn)). 
Therefore, we assume that c*(sj, Oj) is the only information that the SU i has about the contention 
level of the entire CogMesh networking environment, because it is a metric that the SU i can 
easily calculate based on local observations. 

Specifically, from SU i's viewpoint, the probability of experiencing environment state s'^ is Q = 
TT* (sj, aj)c-(sj, tti). In other words, the probability that the SU i receives reward Tli{si, ai, a_j) is 
C^i. Let Hi denote the number of time slots between any two consecutive slot that SU i achieves the 
same reward TZi{si, ai, a_j), then Ui has an independent and identical distribution (i.i.d.) with Q. 
Thereupon, we have = 1/ where Ui is the mean value of rij and can be locally computed 

by the SU i itself through the observation of its reward history. Since SU i knows its own 
transmission strategy 7r*(sj, ai), it can estimate c-(si, aj) through c*(si, ai) = l/[(l+ni)7r*(sj, Oj)]. 
Note that the action available to SU i is to choose the transmission power level according 
to strategy 7r-(si). We can express the SU i's conjecture c*(si,ai) as a function of its own 
transmission strategy. A simple method is to deploy the linear model, i.e., 

c*(si, ai) = Ci{si, ai) - [7r-(si, ai) - 7fi(sj, ai)] , (8) 

where the so-called reference points [fT3l . Cj(sj,aj) and 7fj(sj,aj), are specific conjecture and 
probability, and is a positive scalar. In this paper, the reference points are considered as 

exogenously given and of common knowledge. That is, SU i assumes that the other SUs will 
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observe its deviation from its reference point 7fj(s*, Oj) and the aggregate effect deviates from 
the reference point Ci(sj, Cj) by a quantity proportional to the deviation of 7r*(sj, Oj) — 7fj(sj, Oj). 

Among different choices for capturing the impact of the competing SUs as a function of 
its own strategy, the linear model shown in Equation (8) is the simplest form one can think 
of. In the following, we will show that such simple model is sufficient for the secondary 
users to achieve optimal transmissions. The critical question is how to choose the parameters 
{ci{si,ai),Wi{si,ai),uj^""'^} to achieve the optimal strategies tt*. We can consider setting the 
parameter in Equation (8) to be: 

N 

It's very easy to verify that, if the reference points are Ci{si,ai) = Y[j'=i 
^i{si,ai) — Tr*{si,ai), we have c*{si,ai) — Y[j^=ijjti'^j{^jj^j)- Therefore, such configuration 
of the conjectures c* and the strategies tt* achieve the optimal transmission. In non-cooperative 
learning scenarios, SUs learn when they modify their conjectures based on the new observations. 
Specifically, we first allow the SUs to revise their reference points based on their past local 
observations. We propose a simple rule for the SUs to update their reference points. In time slot 
t, the SU i set Ci(si,ai) and Wi{si,ai) to be c*"^(si,ai) and 7r*"^(si, a^). That is. Equation (8) 
becomes 

ai) = cl-\si, ai) - [nUsi, a^) - nj-^Si, a^)] , (9) 

for t e M. 

B. Conjecture based Q-value Updating 

Eventually, the multi-agent Q-leaming updating rule in Equation (7) is modified as following, 

Ql+^{si,ai) = {1 - a^)Ql{si,ai) + \ ^{si,ai)TZi{si,ai,a_i) + PmaxQl{s'i,bi) \ . (10) 

bieAi j 

The SU i updates its Q-values only with its own information using Equation (10) during the 
stochastic leaming process. To avoid observing the other SUs' private strategy information, the 
SU i conjectures about how its competitors' strategy decisions vary in response to its own 
actions. 
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The purpose of stochastic power allocation is to improve performance by explicitly balancing 
two competing objectives: 1) searching for better transmission power level (exploration) and 
2) gathering as much reward as possible (exploitation), such that the SU not only reinforces 
the evaluation of the power level it already knows to be good but also explores new one. 
Though e-greedy selection [jTJ is an efficient method of balancing exploration and exploitation 
in reinforcement learning. One drawback is that it chooses equally among all available actions 
when it explores. This implies that the worst action is as likely to be chosen as the best one. 

An alternative solution is to vary the action probabilities as a graded function of the Q- 
value. The greedy action is given the highest selection probability, but all the others are ranked 
and weighted according to their Q-values. The most common method is to use a Boltzmann 
distribution. The SU i chooses action in state Sj at time step t with probability [20], 

<^'i^^^)= j2 eQK-..^)A ' ^^^^ 
beAi 

where r is a positive parameter called the temperature. High temperatures cause the action prob- 
abilities to be all nearly equal. Low temperatures cause big difference in selection probabilities 
for actions differ in their Q- values. 

Now, the steps concerning power allocation corresponding to the conjecture-based multi-agent 
Q-leaming algorithm are summarized as follows: 



Algorithm: Conjecture based Multi-agent Q-leaming Algorithm for SU i 

Initialization: 

Let t = 0, 

For each Si, ai Do 

Initialize strategy 7r*(si,ai), Q-values Q\{si,ai), and the parameter c<;f'°' > 0. 

End For 
Evaluate the initial state = s*. 
Learning: 

Loop 
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(1) Choose action Oj according to 7r*(sj). 

(2) Measure the SINR 7^ with the feedback information of the intended secondary 
receiver. Construct the current environment state s[ = s-^^ by identifying the 
transmission power level, and comparing 7^ with the threshold 7*. 

(3) If 7i > 7*, then a reward TZi{si, ai, a_j) can be achieved; otherwise, the receiver 
can not receive correctly, thus obtains zero reward. 

(4) Update (5-^^(si,aj) based on c*(si,aj) according to (5-^^(sj,aj) = 
(1 - a*) Qj{si,ai) + a*ic*(si,ai)7^i(si,ai,a_i) + /3 max (5*(s-, 6^) \. 

(5) Update the strategy 7r*+^(si,ai) = e'^f^^i^^i)!^ j ^ (,Q'^\^^M)lr ^ f^j. ^11 e 

(6) Update the conjecture c*^"'^(sj, Oj) = c-(sj, Oj) — cjf [7r*^"'^(sj, Oj) — vr- (sj, Oj)] . 

(7) = s't\ 
End Loop 



Next, we are interested in the convergence of this algorithm. Our proof relies on the following 
lemma by Szepesvari and Littman [1211 . which establishes the convergence of a general Q-leaming 
process updated by a pseudo-contraction operator. Let Q be the space of all Q-values. 

Lemma: Assume that a* in Equation (10) satisfies the sufficient conditions of Theorem in [23], 
and the mapping "H* : Q — t- Q meets the following condition: there exists a number < A < 1 
and a sequence ^* > converging to zero w.p. 1, such that || H^Q^ —T-L^Q* ||< A || Q^ — Q* \\ 
for all e Q and Q* = E [H^Q*], then the iteration defined by 

Q*+i = (l-a*)Q* + a*(H*Q*), 

converges to Q* w.p. 1. 

For an A^-player stochastic game, we define the operator "H* as follows. 

Definition 3: Let = {Q\, Q%), where G Qi fori = 1, A^, and Q = Qi x ■■■ x Qj^. 
: Q — 7- Q is a mapping on the complete metric space Q into Q, H^Q^ = {T-L^Ql, ■ 
where 

n^Ql = c*(s^,ai)7^i(si,ai,a_^) + /3maxQ.(s-,6i). 

biGAi 
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Then we proceed to prove that Q* = E[H^Q*]. 

Proposition 2: For an A'"-player stochastic game, Q* = ElTi^Q*], where Q* = {Ql, ■ ■ -, Qlf). 
Proof: Since 

<5*(si, ai) = E [TZi {si, Qi, 7r%(si))] + T^^s'. (oj, 7r*i(si)) maxQ- (4, bi) 

N 

From Equation (9), c|(si, a^) = H %)• Thus, 

Q*{si,ai)^E[n'Q*{si,ai)], 

for all and a^. ■ 

We further define the distance between two Q -values. 
Definition 4: For any Q,Q' E Q, we define 

\\Q - Q'W = max max max \Qi{si, a^) - Q[{si, ai)\ . 

t Si (Xi 

Proposition 3: is a contraction mapping operator. 
Proof: The proof is given in Appendix. 

We can now present our main result in this paper that the learning process induced by 
Algorithm converges. 

Theorem: Regardless of any initial value chosen for (5°(si,aj), if r is sufficiently large. 
Algorithm converges. 

Proof: The proof is the direct application of Lemma, which establishes the convergence given 
two conditions. First, is a contraction mapping operator, by Proposition 3. Second, the fixed 
point condition, Q* = E\h}Q*], is ensured by Proposition 2. Therefore, the learning process 
expressed by Equation (10) converges. 

V. Numerical Results 

To demonstrate the performance of the proposed conjecture based multi-agent Q-leaming 
algorithm, we present simulation experiments of a hybrid CogMesh consisting of one PU network 
and one CR network. Users in CogMesh are uniformly distributed over a 300m x 300m square 
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area, and share the same frequency band with bandwidth of 1^ = IMHz. The Unks can 
communication directly if the distance between transmitter and the corresponding receiver is 
no more than 30m. The time is divided into slots, each of length 10ms. During each time slot, 
each PU attempts to transmit with a probability of k, the PU's behavior factor. It's supposed that 
the PUs have only one transmission power level of 200mW, the AWGN power a — 10~^mW, 
and r = 1. Also, we set the power mask to be 200mW for all SUs. The link gains used in this 
paper are given by 

— j , for d > do, 

where X is a constant set to be 10~^, the shadowing factor F is a random number and is 
independent and identically generated from a lognormal distribution with a mean of OdB and 
variance 6dB, d is the physical distance between transmitter and receiver, do is the reference 
distance, and n is the path loss exponent. In the whole simulation process, we set do — 1 and 
n — A. And we here point out that all simulated curves in this paper show the average over 200 
episodes. 

As for the proposed conjecture based multi-agent Q-leaming algorithm, it's implemented by 
each SU with a discount factor jS = 0.9. And we use the following learning rate 

where q;° G [0, 1) is the initial learning rate, and 6* > 1 is a scalar. Like any other learning 
scheme, the SUs need a learning phase to learn the optimal transmission strategies under the 

N 

assumption that each SU can perfectly conjecture the probability Yl during each 

i=i,i^i 

time slot. However, once the strategies are acquired, the SUs take only one iteration to reach 
the optimal energy-efficient transmission configuration, when starting at any initial environment 
states Si{i = 1, ■ ■ ■, N). The major concern for our proposed algorithm is the convergence speed 
of the stochastic learning dynamics. We first simulate a relatively simple networking scenario 
consisting of three pairs of SU links coexisting with three pairs of PU links with a behavior 
factor K, — 0.5. The SUs have two transmission power levels {lOOmW, 200mW}. That is, in the 
proposed algorithm, im — 1 and J\f — {1,2}. 

Without the loss of generality, we take SU 1 for example. Fig. 3 and Fig. 4 show the simulation 
results for different q;° and r, which indicate that the proposed algorithm converges. We can also 
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see from the Fig. 3 that larger r results in worse expected reward. This is because exploration 

lasts for a longer time even if the best power level achieving optimal transmission was already 

visited. Thus, during the learning process, the SU should set a sufficiently large temperature to 

balance the tradeoff between exploration and exploitation or has to dynamically adjust it. The 

curves in Fig. 4 illustrate that when r is small, for smaller the convergence performance is 

worse. Since the Q-values converges slowly, then still exploration phases dominates the learning 

procedure, which may lead to decreasing the opportunities of achieving optimal transmission 

configuration on average. Overall, the performance of our proposed algorithm is good when 

choosing a suitable learning rate a*^. If the algorithm is deployed by the SUs in CogMesh 

environment, has to be chosen in advance. 

Next, for a more general case, we consider that the CR network consists of six SUs co-locating 

with five PUs. The PUs attempt to transmit with a probability k — 1. Each SU has multiple 

transmission power levels. The discrete transmission power levels the SUs used are in the range 

from lOOmW to 200mW equally spaced by 20mW. We compare the expected rewards of SUs 

achieved by the proposed algorithm with the system's optimum 7?.°^* = max7?.j(p) in Fig. 5. 

p 

It can be seen from the graph that the achieved performance is close to the optimum and the 
performance loss is no more than 25% on the average. 

Fig. 6 depicts the expected rewards of the six secondary users versus the PU's behavior factor 
K under the same networking environment assumptions as in Fig. 5. As expected, a higher 
K results in higher interference caused by the PUs to the SUs, i.e., the expected rewards are 
degraded. 

VI. Conclusion 

In this paper, we have studied the non-cooperative power allocation problem specifically 
in CogMesh which is modeled as a stochastic learning process. We extend the single-agent 
Q-leaming algorithm to a multi-user context. Due to the non-cooperation among the local 
clusters, a conjecture based multi-agent Q-leaming approach is proposed to reach the optimal 
transmission strategies with only private and incomplete information. The learning SU performs 
Q-function updating based on the conjecture about other SUs' behaviors over the current Q- 
values. This learning algorithm provably converges given certain restrictions that arise during 
learning procedure, and the simulations demonstrate the effectiveness of the algorithm to improve 
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energy efficiency. The results in this paper provide us with a new approach to design the protocols 
for the non-cooperative CR networks. 
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Appendix 



Proof of Proposition 3. 
Proof: 



max max max 

i Si ai 



< max max max 

i Si ai 



\\H'Q-n'Q'\\ 

— max max max VH*Qi{si, Ui) — %^Q[{si, 

[cj(si, ai) - c'j(si, ai)] Ui{si, ai, a_i)+ 

[ci{si, ai) - c'j(si, ai)] TZi{si, ai, a_i) + 

max max /3 

i Si 

< maxmaxmax|[Q(sj,aj) - ^{si,ai)] 7^j(sj, a^, a_i) | + I3\\Q - Q'\\ . 



max Qi {s[, hi) - max Q[ (s-, hi) 

bieAi bieAi 



max Qi {s'i, hi) - max Q[ {s^, hi) 

bieAi biSAi 



We discuss the first item [ci{si, ai) — ^{si, ai)] Tli{si, ai, a_i) in the last inequality above. Due 
to the fact that the reference points are exogenously given and of common knowledge, then we 
have 

[ci{si, ai) - ^^{si, ai)] Tli{si, ai, a_i) = -wf'"' [Ki{si, ai) - 7r-(sj, ai)] TZi{si, ai, a_i) (A-1) 

We first concentrate on the item 7rj(sj, Oj) in Equation (A-1). By applying Equation (11), we 
have 



'^i{si,ai) 



^Qi{si,ai)/T 
eQi{si,b)/T- 

beAi 



When T is sufficiently large, we get 



Jii{si,ai)/T ^ X + ^'^^^'"^^ +^0^ Qii^i^ «i) 
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where ■»? ^^ilfiiMj is a polynomial of order O yl^9iki^^ J , it's very easy to verify that 



+ 



(A-2) 



rrii + l [rrii + Ijr 

where ^ &)}6) is the polynomial of smaller order than O I ^ ). Note that the 



(A-3) 



coefficient of the polynomial is independent of the Q-value. Similarly, 



rrii + l [rrii + 1)t 



Substituting Equations (A-2) and (A-3) to Equation (A-1) establishes 



[ci{si, Qi) - Oj)] TZi{si, ai, a_i) 

Qi{si,ai) Qi{si,ai) 



1 



mj + 1 



_(mj + l)r (mj + l)T 



+ Q{{Qi{si,h)},)-Q{{Q\{si,h)},) 



That is, we can always take a sufficiently large r such that 

\[ci{si,ai) -^^(sj,ai)]7ej(si,aj,a_i)| < _ , - , 



mi + 



which implies 



/5 



"H Q — "H Q ^ max max max 

i Si a, rrii + 1 

rrii + l 



\[Qi(si,ai)-Q'i(si,ai)]\+l3\\Q-Q'\ 



Therefore, W is a contraction mapping operator. This concludes the proof. 
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Fig. 2. Cognitive wireless mesh networking (CogMesh) scenarios. 
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Fig. 3. Performance, when k — 0.5: Impact of the temperature r to expected rewards achieved by SU 1. 
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Fig. 4. Performance, when n — 0.5: Impact of the learning rate a" to expected rewards achieved by SU 1. 
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Fig. 5. Performance comparison between the proposed algorithm and the system's optimum. 




Fig. 6. The expected rewards of the SUs versus the PU's behavior factor k. 



